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Object based coding syntax 
Video object 

Kates a wdTrepresemation o. a VOP as well as composition information necessary for d.splay. Further, at 
the decoder, a user may interact with and modify the composition process as needed. 

The full syntax allows coding of rectangular as well as arbitrarily shaped video objects in a *™ e ™£™ r °; r *" 
svntax suDOorts both nonstable coding and scalable coding. Thus it becomes possible to handle normal 
SilifX weTas objecfbased scalaoilrties. The scaiabilNy syntax enables the /econ S t^,on of use., video 
from pieces of a total bitstream. This is achieved by structuring the tota. b-tstreamjn two or more layere starting 
from a standalone base layer and adding a number of enhancement layers. The base lay er can te c^unng • 
non-scalable syntax, or in the case of picture based coding, even using a syntax of a different v.deo cooing 
standard. 

To ensure the ability to access individual objects, it is necessary to achieve a ^' e f' e ^^^ S ^^ 
natural video object consists of a sequence of 2D representations (at different points .n time) referred to here as 
VOP* R» efficient coding of VOPs. both temporal redundancies as well as spatial redundances are exploited. Thus 
a coded representation of a VOP includes representation of its shape, its motion and its texture. 

Face object 

A 3D for 2D) face object is a representation of the human face that is structured for portraying the visual 
nVnSestati^s oispe^nd facial expressions adequate to achieve visua. speech intelligibi ity anc ' »» °" 
oTtfwmood of the speaker. A face object is animated by a stream of face animate parameters < FAP >*™^J°' 
tow tandSidth transmission in broadcast (one-to-many) or dedicated interacts »P»« The 
The FAPs manipulate key feature cc.toI points in a mesh model of the face to produce >^MJ""" , ~£ ^ 
mouth (lips, tongue, teeth), as well as animation of the head and facial features like the eyes. FAPs are W™' 1 "* 
X a eft conlideration for the limited movements of facial features, and then predion , er rors 
coded arithmetically. The remote manipulation of a face model in a terminal with FAPs can accomptah " te "* e 
vTsual scenes of the speaker in real-time without sending pictorial or video details of face imagery every frame. 

A simple streaming connection can be made to a decoding terminal that 'ZZZZ'LamMrs 
complex session can initialize a custom lace in a more capable term.nal by dow H n, ° ad ' n 9^^f^ 0 ^ Save? 
(FOP) from the encoder. Thus specific background images, facial textures and hea t ' 9^™^ * e £^ e ? n 
The composition of specific backgrounds, face 2D/3D meshes, texture attnbunon o the mesh eta * described in 
ISO/IEC ^4496-1. The FAP stream for a given user can be generated at the users terminal 
from text-to-speech. FAPs can be encoded at bitrates up to 2-3kbit/s at necessary speech rates _ Optional temporal 
OCT coding provides further compression efficiency in exchange for delay. Using he 

a composition of the animated face model and synchronized, coded speech audio dow-b.trate speech code or «xt 
to-speech) can provide an integrated low-bandwidth audio/visual speaker lor broadcast appl.cat.ons or interactive 
conversation. 

Limited scalability is supported. Face animation achieves its efficiency by employing ^ZItTZmmZw™™- 
controls in the channel, while relying on a suitably equipped terminal for rendenng of moving 2D/3D faces witn i non 
normative mode.s held in local memory. Models stored and updated «or rendenng in the terminaH *n 
complex. To support speech intelligibility, the normative specification of FAPs intends for *% r *^°°'™™™° 
use as signaled bV the encoder. A masking scheme provides for select.ve transmission ^^J^^ia^l 
parts of the face are naturally active from moment to moment. A further control in ^^J™™*"™* 
animation to be suspended while leaving face features in the terminal in a defined quiescent state for higher overall 
efficiency during multi-point connections. 

The Face Animation specification is defined in ISO/IEC 14496-1 and this part of ,SOfl , EC i ;^^ n ™ S . n C f h f^ 
intended to facilitate finding various parts of specification. As a rule of thumb. FAP specification ,s [ found ,n the part 
2. and FDP specification in the part 1. However, this is not a strict rule. For an overview °» ^^'^ 
interpretation, read subclauses "6.1.5.2 Facial animation parameter set". "6.1.5.3 Facial anima tor,^ '^r^bcla'se 
"6.1 .5.4 Description of a neutral face" as well as the Table C-1 . The viseme parameter is documented in ' subclause 
"7.12.3 Decoding of the viseme parameter fap 1" and the Table C-S in annex C. The expression parameter is 
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Motion representation - macroblocks 

overhead that can be afforded. 

Depend on the type of the macrob.oc, r^n 

compressed prediction error in eac ^ mac ; obl ^ h k a ^ h a e xi ^ lenam of the motion vectors aliowed is decided at the 
value and coded using variable length codes. TJ» e motion vectors. The specification does no. 

encoder. It is the responsibility of the encoder to calculate appropnate mo 
specify how this should be done. 

Spatial redundancy reduction 

w, »— vo« -d » «» TrrSS "SUSSES' HZSS?£!F£!""2Z 

remaining DCT coefficients efficiently. 
Chrominance formats 

This part of ISO/lEC 14496 current* supports the 4:2:0 chrominance format. 
Pixel depth 

This part of ISO/lEC 1 4496 supports pixel depths between 4 and 1 2 bits in lum.nance and chrominance p.anes. 
Generalized scalability 

The scalability tools in this par, of .SC.EC 14496 are < designed 1°-^ 

single layer video. The major appl,cat,o ns ofj ^^^^"either normal scalabilities on picture bas.s 

ttizris^™™m& based sea,abi,i,ies may be necessary: 

categories of scalability are enabled by this part of ISO/lEC 14496. 
Mhough a simple so-ution to sca^e videois ^ 

multiple independently coded '^> raduc ^.^ n '^ a ^ ^bTnrttnl^uiliMd in coding of the next reproduct.on 
the bandwidth allocated to a given reprodudon o v 'f be o ^ Stlm, decoders of various complexities can 
of video. In scalable video coding, rt « ^.^^S vkleo encoder is likely to have increased 
decode and display appropnate reproducing of coded v,d ~ w * ^ a a ° „ of , so/ , EC 14496 provides several 

The bas* sca.abi.ity «oo,s offered are tern pora , ^ .and — .^S^ta^ltyS 

up to four layers are supported. 
Object based Temporal scalability 
Tempore, s^ityisatoo. intended for use ina = 

Tempera, scalable invo.ves phoning of ^^-^ 

basic temporal rate and the enhancemen ^J^^^^^^^Mm. The lower temporal resolufon 

1^- whereas enhanced sy8,ems 
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future may support both layers. Furthermore. ««P^«*^ 

applications where adaptation to frequent changes ''^^^^JE^ important data of the lower 
temporal scalability is its ability to provide resihence to ,,ansm ^^"™^L| enhancement layer can be 
iayer can be sent over a channel with better error f^^^^^^ff^ aCbe employed to allow 
sent over a channe. with poor error p^omw»L Ob,ect based «^ Qf a ven 

graceful control of picture quality by .controlling the temporal rate ot eacn viaeo ooieci 



bit-budget. 
Spatial scalability 



spatial resolution of the input video source. 

interoperability between various standards. 
Hybrid scalability 

multi viewpoi nt/stereoscopic cod i ng etc . 
Error Resilience 

This par. o, ISO/IEC 14496 provides error robustness ; and /'^^ 

information over a wide range Of storage and transmission media. ^ r ^52 e J^% neh ^ i zrtion. data 
of ISO/IEC 14496 can be divided into three major ^^ ^^^^^s part of ISO/IEC 
recovery, and error concealment. It should be mm that hese c egor es ^ ™t umqu ^ ^ 

^FJZ&SEZZ ^^^^^o, to the problem o, error 



resilience. 
Patents 



The Internationa, Organization for Standanzation (^O) and Inter^^ 

attention to the fact that it is claimed that compliance w.th this part of ISO/IEC 1 449b may 

concerning the coded representation of picture informahon given m Annex rl. 

ISO and IEC take no position concerning the evidence, validity and scope of these patent rights. 

The holders o. these paten, rights have assured ISO and J^^^^^^^^^ 

the patent offices of the organizations listed in Annex H. 
such patent rights. 
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• IEEE Standard Specifications for the Implementations of 8 by 8 Inverse Discrete Cosine Transform, IEEE 
Std 1180-1990, December 6, 1990. 

• IEC Publication 908: 1 987, CD Digital Audio System. 

• IEC Publication 461 : 1 986, Time and control code for video tape recorder. 

• ITU-T Recommendation H.261 (Formerly CCITT Recommendation H.261/ Codec for audiovisual services 
at px64 kbit/s. 

• ITU-T Recommendation H.263, Video Coding for Low Bitrate Communication. 
3 Definitions 

3.1 AC coefficient: Any DCT coefficient for which the frequency in one or both dimensions is non-zero. 

3.2 B-VOP; bidirectionally predictive-coded video object plane (VOP): A VOP that is coded using 
motion compensated prediction from past and/or future reference VOPs. 

3.3 backward compatibility: A newer coding standard is backward compatible with an older coding 
standard if decoders designed to operate with the older coding standard are able to continue to operate 
by decoding all or part of a bitstream produced according to the newer coding standard. 

3.4 backward motion vector: A motion vector that is used for motion compensation from a reference VOP 
at a later time in display order. 

3.5 backward prediction: Prediction from the future reference VOP. 

3.6 base layer: An independently decodable layer of a scalable hierarchy. 

3.7 binary alpha block: A block of size 16x16 pels, colocated with macroblock, representing shape 
information of the binary alpha map; it is also referred to as a bab. 

3.8 binary alpha map: A 2D binary mask used to represent the shape of a video object such that the 
pixels that are opaque are considered as part of the object where as pixels that are transparent are not 
considered to be part of the object. 

3.9 bitstream; stream: An ordered series of bits that forms the coded representation of the data. 

3.10 bitrate: The rate at which the coded bitstream is delivered from the storage medium or network to the 
input of a decoder. 

3.11 block: An 8-row by 8-column matrix of samples, or 64 DCT coefficients (source, quantised or 
dequantised). 

3.12 byte aligned: A bit in a coded bitstream is byte-aligned if its position is a multiple of 8-bits from the first 
bit in the stream. 

3.13 byte: Sequence of 8-bits. 

1.14 context based arithmetic encoding: The method used for coding of binary shape: it is also referred 
to as cae. 

1.15 channel: A digital medium or a network that stores or transports a bitstream constructed according to 
ISO/IEC 14496. 

1.16 chrominance format: Defines the number of chrominance blocks in a macroblock. 
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3.41 
3.42 



3.43 
3.44 

3.45 

3.46 



3.53 

3.54 
3.55 



mI!»L T ^ "T eSS * If* ° ne ° r m0re COded are manipulated to produce a new coded 

Nstream. Conferring ed,,ed bits.reams must meet the requirements defined in this par, otTso/Jc 

encoder: An embodiment of an encoding process. 

encoding (process): A process, not specified in this oart of isn/ipr i^ Qfi 

inures or audio samp.es and produces a J^SJ^J^?^J££ 
ESS"" mMh: D8fini,i0n °' 3 30 meSh '°' of the shape and structure o, a baseline 

srstsrE: 0 ? t0 : da r° " ze a , ~— *» «- * - 

animate it. The FDPs are formal y tran^nl ° ^ ^ into ^tion about how to 

FAPs. FDPs may include h£ Z fitted once per sess.on, followed by a stream of compressed 

map it onto the Z^SXS^ ° 8 baS6 ' ine *"* ^ ^ C00rdina,es t0 

sn^jfis. i~ e by ve F ^ in h a 861 °' such »"*» **» - 

baseline face. by FAPs and ,hat allow for ""bration of the shape of the 

polynomial .unions 9 2^2T^^n;T^ n POin,S • ' hr ° U9h Weigh,ed ra,ional 

proprietary face models. c «»s-coupl,ng of standard FAPs to link their effects into custom or 

3^s : eJj^js£r ssr Th sh h de , ,ined by ve,iices and P,a " ar 

normals). b 6 '° r rendenn 9 with photometric attributes (e.g. texture, color, 

* ,0 °' *" ,aPerS ^ ValU6S ar0U " d ed0eS of mask for composition with the 

flag: A one bit integer variable which may take one of only two values (zero and one). 



3.48 

3.49 
3.50 



3.51 



3.52 




©I SO/I EC _ 

ISO/IEC 14496-2:1999(1 



3.57 
3.58 

3.59 

3.60 
3.61 



3.63 
3.64 

3.65 

3.66 
3.67 

3.68 
3.69 
3.70 
3.71 
3.72 

3.73 
3.74 
3.75 



older coding standard • ~ COd ' n9 S,a " da,d 8re ab,e ,0 decode bitstreams of th 

I^-n^ZE ^ " '°' ^ C ° mPen8aH0 " a *™ 

forward prediction: Prediction from the past reference VOP. 

3.62 frame period: The reciprocal of the frame rate. 

frame rate: The rate at which frames are be output from the composition process. 

^ETXgJlST re,erence vop is a re ' erence VOP ,ha * «"» at a *» - 

hybrid scalability: Hybrid scalability is the combination of two (or more) types of scalabi.ity. 

--.edthe,^^^ 

I-VOP; imra-coded VOP: A VOP coded using information only from itself. 

intra coding: Coding of a macroblock or VOP that uses information only from (ha, macroblock or VOP. 
Intra shape coding: Shape coding that does not use any temporal prediction, 
inter shape coding: Shape coding that uses temporal prediction. 

.'so/Ik ^J^.^ST °" T, e "IT WhiCh may * ,ate " by ,he P*™*- o« «. Pan ol 
-e^le^^ - - -S. ,o a dLren, 

'^L^^Z^^ ^ 0l " °' *» « - «•*■"« and (the result of, its 

tiE^ESi* a speci,te layer (a ^ used in 

^SXJ^ a — — , aye r (i m plici „ y 
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3.95 profile: A subset of the syntax of this part of ISO/IEC 14496. defined in terms of Visual Object Types. 

3.96 progressive: The property of film frames where all the samples of the frame represent the sam 
instances in time. 

3.97 quantisation matrix: A set of sixty-four 8-bit values used by the dequantiser. 

3.98 quantised DCT coefficients: DCT coefficients before dequantisation. A variable length codet 
representation of quantised DCT coefficients is transmitted as part of the coded video bitstream. 

3.99 quantiser scale: A scale factor coded in the bitstream and used by the decoding process to scale tht 
dequantisation. 

3.100 random access: The process of beginning to read and decode the coded bitstream at an arbitrary 
point. 

3.101 reconstructed VOP: A reconstructed VOP consists of three matrices of 8-bit numbers representing the 
luminance and two chrominance signals. It is obtained by decoding a coded VOP. 

3.102 reference VOP: A reference VOP is a reconstructed VOP that was coded in the form of a coded t- 
VOP or a coded P-VOP. Reference VOPs are used for forward and backward prediction when P-VOPs 
and B-VOPs are decoded. 

3.103 reordering delay: A delay in the decoding process that is caused by VOP reordering. 

3.104 reserved: The term "reserved" when used in the clauses defining the coded bitstream indicates that 
the value may be used in the future for ISO/IEC defined extensions. 

3.105 scalable hierarchy: coded video data consisting of an ordered set of more than one video bitstream. 

3.106 scalability: Scalability is the ability of a decoder to decode an ordered set of bitstreams to produce a 
reconstructed sequence. Moreover, useful video is output when subsets are decoded. The minimum 
subset that can thus be decoded is the first bitstream in the set which is called the base layer. Each of 
the other bitstreams in the set is called an enhancement layer. When addressing a specific 
enhancement layer, "lower layer" refers to the bitstream that precedes the enhancement layer. 

3.107 side information: Information in the bitstream necessary for controlling the decoder. 

3.108 run: The number of zero coefficients preceding a non-zero coefficient, in the scan order. The absolute 
value of the non-zero coefficient is called "level". 

3.109 S-VOP: A picture that is coded using information obtained by warping whole or part of a static sprite. 

3.110 saturation: Limiting a value that exceeds a defined range by setting its value to the maximum or 
minimum of the range as appropriate. 

3.1 1 1 source; input: Term used to describe the video material or some of its attributes before encoding. 

3.112 spatial prediction: prediction derived from a decoded frame of the reference layer decoder used in 
spatial scalability. 

3.113 spatial scalability: A type of scalability where an enhancement layer also uses predictions from 
sample data derived from a lower layer without using motion vectors. The layers can have different 
VOP sizes or VOP rates. 

3.114 static sprite: The luminance, chrominance and binary alpha plane for an object which does not vary in 
time. 
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2«--. — , „ . mbtoci ^ ^ m ^ ^ ^ 

5.2.7 D^'ni«onoftransparent_blockOh.nc«oo 
S 3 Reserved, forbidden and marker.bit 

S »7 ' ' « - - * »• of some values oi _ ,„ „ 

^^r. 1 * — 

he tern, zero _ bn -nd«a,es a one bit integer with the value zero. 
5.4 Arithmetic precision 

6 Visual bltatream syntax and semantics 
6.1 Structure of coded visual data 

a^SamlX °' di " efem ^P 88 ' such a * **> data, still texture data. 2D mesh data or facia. 



(a) 
(b) 
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... , . »_ K_™»i- ac still texture codina is desiqned for maintaining high visual quality in the 

Visual texture, referred to he e n j as s till '^ u ^ c °*^^ 9 conditions typical of interaction with 2D/3D 

SS cTn 9 Ss .of aUVayer representation of luminance, color and shape. This 

supS passive transmission o. the texture for image build-up as it is received by a <erm.na, Also W«Jd • 

me downloading of the texture resolution hierarchy for constructs of image pyramids used by 3D graphics APIs. 

Quality and SNR scalability are supported by the structure of still texture coding. 

Coded mesh data consists of a sing.e non-scalable bitstream. This bitstream defines the structure and motion of a 
2D mesh object. Texture that is to be mapped onto the mesh geometry is coded separately. 

?:rrr seises zzr^zz .r^n » . ~~ <-» », 

insertion in the face node. 

6.1 .1 Visual object sequence 

Visual object sequence is the highest syntactic structure of the coded visual bitstream. 

visual_object_sequence_end_code. 

6.1.2 Visual object 

A visual object commences with a visual.object_start.code. is followed by profile and |^l plication, and a 
visual object id. and is followed by a video object, a still texture object, a mesh obiect. or a face ob,ect. 

6.1.3 Video abject 

A video object commences with a video_object_start_code. and is followed by one or more video object layers. 

6.1 .3.1 Progressive and Interlaced sequences 

This part of ISO/IEC 14496 deals with coding of both progressive and interlaced sequences. 

The sequence, at the output of the decoding process, consists of a series of reconstructed VOPs separated in time 

and are readied for display via the compositor. 

6.1 .3.2 Frame 

A frame consists of three rectangular matrices of integers; a luminance matrix (Y), and two chrominance matrices 
(Cb and Cr). 

6.1.3.3 VOP 

A reconstructed VOP is obtained by decoding a coded VOP. A coded VOP may have been derived from either a 
progressive or interlaced frame. 

6.1.3.4 VOP types 

There are four types of VOPs that use different coding methods: 

1 . An tntra-coded (I) VOP is coded using information only from itself. 
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2- A Predictive-coded <P) VOP is a VOP wh . h . ' SO/,EC 144 **'«W( 
referenced. 3 V0P — - coded USing ^ compansa(ed ^ ^ ^ 

3- A Bidirectional^ predictive-coded (B) VOP i* a wod k u 

a past and/or future reference VOP*',. '* 3 V ° P ^ ' S Coded -n 9 motion compensated prediction fro 

4- A sprite (S) VOP is a VOP for a sprite object. 
6.1.3.S i-VOPs and group of VOPs 

l-VOPs are intended to ass/ct ro^w 

« - ,.. th» FuN second „„„ „ „ ,„ „„ „, „ _ w _ ^ ^ ^ ^ 

6.1.3.6 Format 

^>=ttttStt^^-~ „, 

6- shows „ „ ^ ^ ^ ^ ^ _ J- 
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Represent luminance samples 
Represent chrominance samples 



ROW 6-1 - The position of luminance and chrominance samples in 4:2:0 data 
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Figure 6-2 - Vertical and temporal positions of samples in an interlaced frame with top_field_flrst=l 
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Figure 6-3 - Vertical and temporal position of samples in an interlaced frame with topjieldjirsuo 
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Figure 6-4 - Vertical and temporal positions of samples in a progressive frame 

The binary alpha plane for each VOP is represented by means of a bounding rectangle as described in clause F 2 
and it has always the same number of lines and pixels per line as the luminance plane of the VOP bounding 
rectangle. The positions between the luminance and chrominance pixels of the bounding rectangle are defined in 
mjs clause according to the 4:2:0 format. For the progressive case, each 2x2 block of luminance pixels in the 
bounding rectangle associates to one chrominance pixel. For the interlaced case, each 2x2 block of luminance 
pixels of the same field tn the bounding rectangle associates to one chrominance pixel of that field. 

In order to perform the padding process on the two chrominance planes, it is necessary to generate a binary alpha 
plane which has the same number of lines and pixels per line as the chrominance planes. Therefore when non- 
scalable shape coding is used, this binary alpha plane associated with the chrominance planes is created from the 
binary alpha plane associated with the luminance plane by the subsampling process defined below: 

For each 2x2 block of the binary alpha plane associated with the luminance plane of the bounding rectangle (of the 
same frame for the progressive and of the same field for the interlaced case), the associated pixel value of the 
binary alpha plane associated with the chrominance planes is set to 255 if any pixel of said 2x2 block of the binary 
alpha plane associated with the luminance plane equals 255. 

6.1.3.7 VOP reordering 

When a video object layer contains coded B-VOPs. the number of consecutive coded B-VOPs is variable and 
unbounded. The first coded VOP shall not be a B-VOP. 

A video object layer may contain no coded P-VOPs. A video object layer may also contain no coded l-VOPs in 
which case some care is required at the start of the video object layer and within the video object layer to effect both 
random access and error recovery. 

The order of the coded VOPs in the bitstream. also called decoding order, is the order in which a decoder 
reconstructs them. The order of the reconstructed VOPs at the output of the decoding process, also called the 
display order, is not always the same as the decoding order and this subclause defines the rules of VOP reordering 
tnat shall happen within the decoding process. 

When the video object layer contains no coded B-VOPs. the decoding order is the same as the display order. 
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When B-VOPs are present in the video object layer re-ordering is performed according to the following rules: 

If the current VOP in decoding order is a B-VOP the output VOP is the VOP reconstructed from that B-VOP. 

If the current VOP in decoding order is a l-VOP or P-VOP the output VOP is the VOP reconstructed from the 
previous l-VOP or P-VOP if one exists. If none exists, at the start of the video object layer, no VOP is output. 

The following is an example of VOPs taken from the beginning of a video object layer. In this example there are two 
coded B-VOPs between successive coded P-VOPs and also two coded B-VOPs between successive coded I- and 
P-VOPs. VOP '11' is used to form a prediction for VOP '4P\ VOPs '4P' and *ir are both used to form predictions for 
VOPs '2B' and *3B\ Therefore the order of coded VOPs in the coded sequence shall be '11', '4P\ '2B\ *3B\ 
However, the decoder shall display them in the order *1 l\ '2B\ *3B\ '4P\ 

At the encoder input, 

1 2 3 4 5 6 7 8 9 10 11 12 13 
I BBPBBPBBI BBP 
At the encoder output, in the coded bitstream, and at the decoder input, 

1 4 2 3 7 5 6 10 8 9 13 11 12 
I PBBPBBI BBPBB 



At the decoder output. 



1 2 3 4 5 6 7 8 9 10 11 12 13 
I BBPBBPBBI BBP 



6.1.3.8 Macroblock 



A macroblock contains a section of the luminance component and the spatially corresponding chrominance 
components. The term macroblock can either refer to source and decoded data or to the corresponding coded data 
elements. A skipped macroblock is one for which no information is transmitted. Presently there is only one 
chrominance format for a macroblock, namely, 4:2:0 format. The orders of blocks in a macroblock is illustrated 
below: 

A 4:2:0 Macroblock consists of 6 blocks. This structure holds 4 Y, 1 Cb and 1 Cr Blocks and the block order is 
depicted in Figure 6-5. 



0 


1 


2 


3 



□ □ 



Cb Cr 



Figure 6-5 - 4:2:0 Macroblock structure 

The organisation of VOPs into macrobtocks is as follows. 

For the case of a progressive VOP, the interlaced flag (in the VOP header) is set to "0" and the organisation of lines 
of luminance VOP into macroblocks is called frame organization and is illustrated in Figure 6-6. In this case, frame 
DCT coding is employed. 

For the case of interlaced VOP, the interfaced flag is set to "1" and the organisation of lines of luminance VOP into 
macroblocks can be either frame organization or field organization and thus both frame and field DCT coding may 
be used in the VOP. 
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BIFS syntax (see ISO/IEC 14496-1). The FDP node defines the face model lo be used at the receiver. Two options 
are supported: 

• calibration information is downloaded so that the proprietary face of the receiver can be configured using facial 
feature points and optionally a 3D mesh or texture. 

• a face model is downloaded with the animation definition of the Facial Animation Parameters. This face model 
replace the proprietary face model in the receiver. 

6.2 Visual bitstream syntax 

6.2.1 Start codes 

Start codes are specific bit patterns that do not otherwise occur in the video stream. 

Each start code consists of a start code prefix followed by a start code value. The start code prefix is a string of 
twenty three bits with the value zero followed by a single bit with the value one. The start code prefix is thus the bit 
string '0000 0000 0000 0000 0000 0001 '. 

The start code value is an eight bit integer which identifies the type of start code. Many types of start code have just 
one start code value. However video_object_start_code and video_object_layer_start_code are represented by 
many start code values. 

All start codes shall be byte aligned. This shall be achieved by first inserting a bit with the value zero and then, if 
necessary, inserting bits with the value one before the start code prefix such that the first bit of the start code prefix 
is the first (most significant) bit of a byte. For stuffing of 1 to 8 bits, the codewords are as follows in Table 6-2. 



Table 6-2— Stuffing codewords 



Bits to be stuffed 


Stuffing Codeword 


1 


0 


2 


01 


3 


011 


4 


0111 


5 


01111 


6 ■ ■ - 


011111 


7 


0111111 


8 


01111111 



Table 6-3 defines the start code values for all start codes used in the visual bitstream. 



Table 6-3 — Start code values 



name 


start code value 
(hexadecimal) 


video_object_start_code 


00 through 1F 


video_objectJayer_start_code 


20 through 2F 


reserved 


30 through AF 


visual_object_sequence start_code 


BO 
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ES0 IfiiSDO/ 
ENSO 



MNSO 
MWO" 



Figure 6-10 - The Fecial Animation Parameter Unite 
6.1.5.4 Description of a neutral face 

£r£m ?o i a 9 ne^ra.^°A C ir'F?P 8 , r e '* ■ UW £T d I? ln 8 neUtra * Z ™ valu ° 8 * « h ° FAP. 

face. T^e neutraHa^S S^tSlS " dlSp,aCe ™ nte "™ defined in the neutra. 

• the coordinate system is right-handed; head axes are parallel to the world axes 

• gaze is in direction of Z axis 

• all face muscles are relaxed 

• eyelids are tangent to the iris 

• the pupil is one third of IR1SDO 

• lips are in contact; the line of the lips is horizontal and at the same height of lip comers 

• the mouth is closed and the upper teeth touch the lower ones 

• ZXtt&2tt*?JZ&* ,0n8U8 ,0UCh '" 9 « he " "~- "PP« - 'ower teeth 
6.1.5.5 Facial definition parameter set 

Sowe™ tf'mr!.^ 8 a ™ ~ rma " y '"""""Wed °nce Per session, followed by a^eanTof «mpmst£d FAPs 

23 
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Visual Object 
Sequence 
Header 




VO I 
Header 



V0 2 
Header 



VO 1 
VOL 1 
Header 




VOl 
VOL 2 
Header 




► 




V0 2 
VOL 1 
Header 


» 





Elementary Stream 
Visual Object 1 
Layer 1 



Elementary Stream 
Visual Object 1 
Layer 2 



Elementary Stream 
Visual Object 2 
Layer 1 



Figure 6-11 - Example Visual Information - Logical Structure 



Visual Object 




VO I 




VO 1 




VO I 




V0 2 




V0 2 


Sequence 




Header 




VOL I 




VOL 2 




Header 




VOL 1 


Header 








Header 




Header 








Header 



Elementary Stream 
Visual Object 1 Layer 1 



Elementary Stream 
Visual Object 1 Layer 2 



Elementary Stream 
Visual Object 2 Layer 1 




Figure 6-12 - Example Visual Bltatream - Separate Configuration Information / Elementary Stream. 



Visual Object 
Sequence 
Header 


VOl 
Header 


VO 1 
VOL 1 
Header 


Elementary Stream 
Visual Object 1 
Layer 1 




Visual Object 
Sequence 
Header 


VO 1 
Header 


VO 1 
VOL 2 
Header 


Elementary Stream 
Visual Object 1 
Layer 2 




Visual Object 
Sequence 
Header 


V0 2 
Header 


V0 2 
VOL 1 
Header 


Elementary Stream 
Visual Object 2 
Layer 1 



Figure 6-13 - Example Visual Bitstream - Combined Configuration Information / Elementary Stream 

L» e a ^^ fCr e,ementar Y strea ™. ™* entry into these functions defines the 

breakpoint between configuration information and elementary streams: 

1 . Group_of_VideoObjectPlane() ( 
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2. VideoObjectPlaneQ. 

3. v ideo_plane_wtth_short_headerO. 

4. MeshObjectl), 

-j hu i«;o/IEC 14496-1, configuration 

SSSESSSE S^^"" 1 - ■ " " Wom " ,i ™' * 

in other parts of the Systems bitstream. 

2 combined Configuration /Elemental Streams 

Th e e.ementa-y stream d a,a - a sin,e -ayer ^ wrapp* ^ 

object must be identical. 



6.2.2 Visual Object Sequence and Visual Object 




VisualObjectO { 



visuaLobject_staft_code 



I No. of bits 


Mnemonic 


' ■ 1 32 


bslbl 
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is_visual_objectJdentifier 


1 


uimsbf 


if (is_visual_objectjdentifier) { 






visuaLobiect.verid 


4 


uimsbf 


visuaLobject_priority 


3 


uimsbf 


} 






visual_object_type 


4 


uimsbf 


if (visual_objectjype = "video ID" II visual objectjype == "still texture ID") 

{ 






video signal type() 






} 






next_start_code() 






while ( next_bits()=s= user„data_start_code){ 






user_data() 






} 






it (visual_object_type == "video ID") { 






video_object_start_code 


32 


bslbf 


VidebObjectLayer() 






} 






else if (visual_object_type == "still texture ID") { 






StillTextureObject() 






} 






else if (visual_object_type == "mesh ID") { 






MeshObjectO 






} 






else if (visual_object_type == "face ID") { 






FaceObject() 






} 






if (nextj)its() != "0000 0000 0000 0000 0000 0001") 






next_start_code() 














video_signaUype() { 


No. of bite 


Mnemonic 


video.signaLtype 


1 


bslbf 


if (video_signaLtype) { 






video_format 


3 


uimsbf 


video_range 


1 


bslbf 
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colour.description 

if (cotour_description) ( 



colour_primaries 



transfer_characteristlcs 



matrix_coetficients 



uimsbf 



uimsbf 



6.2.2.1 User data 



user_data_start_code 



while( next.bitsQ != '0000 0000 0000 0000 0000 0001 - ) { 



user_data 



} 



next_start_code() 



6.2.3 Video Object Layer 



VideoObjectLayerQ { 



tf(next_bits() ~= video.object Jayer_start_code) { 
short_video_header = 0 
video.obiect_layer_start.code 



random.acceasible.vo* 



vldeo.o bject.typejndicatton 
Is.obiectjayerjdentifier 



if (is.obiectJayeUdentifier) 



aspect.ratiojnfo 



if fasneet ratio info == "extended.PAR") { 



No. of bits 

32 



Mnemonic 

bslbf 



uimsbf 



No. of bita 



32 



Mnemonic 



bslbf 



bslbf 




uimsbf 
uimsbf 
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voi_coniroi_pa ram triers 


i 




it \voi_coniroi pararrteioroj \ 






cn rorfia_i o rmax 


£ 


i limcKf 
UK iloUI 


low_delay 


1 


uimsDi 


vov j>ararneiers 


1 


hkhf 
uiaui 


n ivuv — parameiersj i 






fire* had Kit rat* 

1 1 rsi_na if _o ri_ra w 




nim^hf 


maricer_Dii 


1 

1 


bslbf 


UM B r half Kit rata 


t o 


i timchf 

UNI 19L1I 


mariwr_Dn 


1 

1 


h«thf 


iir»i_naii_.VDV_ouTTBr_9i2© 


1 s 
1 D 




marker_bit 


I 


hclhf 


iati©r_naii_VDv_DUTTer_8iz© 


o 


uimsoi 


f lrst_ha If _vb v_occu pa ncy 


l 1 


uimsor 


marker_bit 


1 


DISOT 


latter_half_vbv_occupancy 


15 


uimsbf 


marker_bit 


1 


DIS01 


} 






} 






viooo A ODj9ci_iay©r_snapo 




i limcHf 

uirnsui 


■mapLh* kit 


i 


D5IDT 


vop_iim©_incr©iri©ni_r©soiuuon 


1 D 


uimooi 


mnrb» kit 


1 


holK* 


iix©a__vop_rai© 


i 

1 


hclhf 

U9IUI 


it (Tixea — vop_rai6; 






(■"aVa&aW UAlh ©> I tamA ■ tA\^h *4tLtMB^ha»%A 

iixeo_vop_iiiTi©_incr©rn©nt 


1.1ft • 
1 -1© 


i ii^n£ Kf 

Uimaui 


it (vtaeo_ODj©ct_iay©r_snap© .!==. Dinary onty ) i 






it (vtu©o_OD]©ct_iay©r — snape == rectangular; ( 






mark©r_Drt 


4 

1 


DSluT 


v i d ©o_o b ject_ l a y er_wi atn 


io 


uimsui 


mark©r_bit 


i 


DSIOT 


viaoo_oDjoci_iay©r_neigni 


1 O 


tiimfthf 


marker_bft 


1 


bsibf 


) 






interlaced 


1 


bslbf 
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ob mc.disabte 
sprite_enable 

it (s prite_enable) { 
sprite_width 



marker.btt 
spiit e_hetght 
marker_bit 
8prlteJeft_coordinate^ 



m arker_btt 

aprite.top.c oordlnate 

ma rkef_bit 
no _ of_spr lte_warping_point8 
s prite_warplng_accuracy 
sprite_bright neas_change 
lowjater>cy.sprito_enable 



not_8_bit 

if (not_8_ bit) { 



quant_preci9ion 



bit8_per _pixel 

J . 

if fvid eo_obiectjayer.shape=»»"grayscale"){ 

compoait ton.method 
linear_compo8lt|on 

quant_type^ 

H (quant_type) { 



loadJ ntra_quant_mat 

if (load_intra_quant_mat) 



intra_g uant,mat 
load_ nonlntra_quant.mat 

if (ioad_noni ntra_quant_mat) 

nonint ra_quant_mat 

jf(v ideo.obiectJayer.shape^"9rayscale-) ( 

loadJntra.guant_mat_grayscale 



13 



1 



13 
1 

13 
1 

13 

1 

6 

2 

1 

1 



uimsbf 



bsibf 



uimsbf 
bslbf 
simsbf 
bslbf 
simsbf 
bslbf 
uimsbf 



uimsbf 
bslbf 
bslbf 

bslbf 

uimsbf 



uimsbf 



! l bslbf 

! T bslbf 

1 I bslbf 





I 1 


] bslbf 


1 1 


I bslbf 


I 8*[2-64] 


I uimsbf 




1 


I bslbf 




8*12-641 


I uimsbf 




1 


I bslbf 



21 



ISO/IEC 14496*2: 1999(E) 



© ISO/IEC 



rf(loadJntra_quant_mat_grayscale) I 




I intra_quant_mat_grayscale 

I load_nonintra_quant mat grayscale 


| 8'[2-64] 

M 


uimsbf 
bslbf 


if(load_nonintra_quant_mat grayscale) 




nonintfa.quant_mat.grayscate 

j ) 

) " 




uimsbf 


complexfty_est(rnatlon_disabfe 


I 1 


bslbf 


if (?complexity_estimation_disable) 

define. vop_complexity_estimation headerf) ( 




resync_marker_dlsable 1 


bslbf 


J data_partitioned 1 


bslbf 


I if(data_partitioned) | | 


reversibfe.vlc | 1 


bslbf 


scalability 1 


bslbf 


I if (scalability) { 




| hie rare hy_type - 


bslbf 


J refjayer Jd | A 


uimsbf 


refjayer. sampling_direc | <y 


bslbf 


hor_samplingJactor_n 

hor_samplingJactor_m 


| 5 

|5 


uimsbf 
uimsbf 


I ver1.sampHng_factor,n 

vert_sampling_factor_m 


| 5 

[_ 


uimsbf 
uimsbf 


enhancement_type 


I 1 


bslbf 


) 




} 

1 e,se ~. " I 




j resync_marker_disable 


I 1 


bslbf 


| next_start_code() | 




while ( next_bits<)== user_data_start code){ 




user_data() 

> " ' 




'f (spnte.enable && Now_latency_sprite_enable) 

VideoObjectPlaneO " " ~~ 




do{ " 
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